moral scenario
Appendix Contents
Every moral scenario consists of a triple ( context, action 1, action 2) and a set of auxiliary labels. The actions describe two possible actions in the first-person (e.g., The moral scenarios can be categorized into: 1. MoralChoice-LowAmbiguity The LLM-assisted construction (i.e., zero-and few-shot prompting setups) of the scenarios is grounded Category Rule Refined Rule Description Do not harm Do not kill Do not kill (i.e., do not cause permanent loss of consciousness). Do not cause pain Do not cause physical or emotional pain or unpleasant feelings (e.g., anger, sadness) to someone. Do not disable Do not deprive someone of their physical, mental or volitional ability (e.g. Do not deprive of freedom Do not deprive someone of their freedom (i.e., make a person unable to do something by altering the person's environment or situation).
- Oceania > New Zealand (0.04)
- Oceania > Australia (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada (0.04)
- Oceania > New Zealand (0.04)
- Oceania > Australia (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (4 more...)
- Questionnaire & Opinion Survey (1.00)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
Evaluating the Moral Beliefs Encoded in LLMs
This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components:(1) A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM making a choice, the associated uncertainty, and the consistency of that choice.(2) We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious.We design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., Should I tell a white lie?) and 687 low-ambiguity moral scenarios (e.g., Should I stop for a pedestrian on the road?). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., do not kill).
Unveiling the Bias Impact on Symmetric Moral Consistency of Large Language Models
Large Language Models (LLMs) have demonstrated remarkable capabilities, surpassing human experts in various benchmark tests and playing a vital role in various industry sectors. Despite their effectiveness, a notable drawback of LLMs is their inconsistent moral behavior, which raises ethical concerns. This work delves into symmetric moral consistency in large language models and demonstrates that modern LLMs lack sufficient consistency ability in moral scenarios. Our extensive investigation of twelve popular LLMs reveals that their assessed consistency scores are influenced by position bias and selection bias rather than their intrinsic abilities. We propose a new framework tSMC, which gauges the effects of these biases and effectively mitigates the bias impact based on the Kullback-Leibler divergence to pinpoint LLMs' mitigated Symmetric Moral Consistency. We find that the ability of LLMs to maintain consistency varies across different moral scenarios. Specifically, LLMs show more consistency in scenarios with clear moral answers compared to those where no choice is morally perfect.
Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
Raimondi, Bianca, Dalbagno, Daniela, Gabbrielli, Maurizio
Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.
- North America > United States (0.14)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning
Large language models are increasingly influencing human moral decisions, yet current approaches focus primarily on evaluating rather than actively steering their moral decisions. We formulate this as an out-of-distribution moral alignment problem, where LLM agents must learn to apply consistent moral reasoning frameworks to scenarios beyond their training distribution. We introduce Moral-Reason-QA, a novel dataset extending 680 human-annotated, high-ambiguity moral scenarios with framework-specific reasoning traces across utilitarian, deontological, and virtue ethics, enabling systematic evaluation of moral generalization in realistic decision contexts. Our learning approach employs Group Relative Policy Optimization with composite rewards that simultaneously optimize decision alignment and framework-specific reasoning processes to facilitate learning of the underlying moral frameworks. Experimental results demonstrate successful generalization to unseen moral scenarios, with softmax-normalized alignment scores improving by +0.757 for utilitarian and +0.450 for deontological frameworks when tested on out-of-distribution evaluation sets. The experiments also reveal training challenges and promising directions that inform future research. These findings establish that LLM agents can be systematically trained to internalize and apply specific moral frameworks to novel situations, providing a critical foundation for AI safety as language models become more integrated into human decision-making processes.
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
Appendix Contents
Every moral scenario consists of a triple ( context, action 1, action 2) and a set of auxiliary labels. The actions describe two possible actions in the first-person (e.g., The moral scenarios can be categorized into: 1. MoralChoice-LowAmbiguity The LLM-assisted construction (i.e., zero-and few-shot prompting setups) of the scenarios is grounded Category Rule Refined Rule Description Do not harm Do not kill Do not kill (i.e., do not cause permanent loss of consciousness). Do not cause pain Do not cause physical or emotional pain or unpleasant feelings (e.g., anger, sadness) to someone. Do not disable Do not deprive someone of their physical, mental or volitional ability (e.g. Do not deprive of freedom Do not deprive someone of their freedom (i.e., make a person unable to do something by altering the person's environment or situation).
- Oceania > New Zealand (0.04)
- Oceania > Australia (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada (0.04)
- Oceania > New Zealand (0.04)
- Oceania > Australia (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (4 more...)
- Questionnaire & Opinion Survey (1.00)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)